protein family
- North America > United States (0.04)
- Asia > Middle East > Jordan (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > France (0.04)
- North America > United States > Virginia (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
PoET: A generative model of protein families as sequences-of-sequences
Generative protein language models are a natural way to design new proteins with desired functions. However, current models are either difficult to direct to produce a protein from a specific family of interest, or must be trained on a large multiple sequence alignment (MSA) from the specific family of interest, making them unable to benefit from transfer learning across families.
- North America > United States (0.04)
- Asia > Middle East > Jordan (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > France (0.04)
- North America > United States > Virginia (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Protein Language Model Zero-Shot Fitness Predictions are Improved by Inference-only Dropout
Ravuri, Aditya, Lawrence, Neil D.
Protein Language Models (PLMs) such as ESM2 (Lin et al., 2023) have been shown to be capable of zero-shot prediction of critical scalar properties of proteins ("fitness", Meier et al. (2021)). In this work, we show that injecting a dropout layer at inference time between a PLM's featurizer/embedding layer and its transformer, and averaging its output akin to Monte-Carlo dropout (Gal & Ghahramani, 2016) increases zero-shot performance on a subset of the ProteinGym dataset (Notin et al., 2023). This is the case even when the model was not trained with dropouts to begin with, and does not require retraining or finetuning of the PLM. A dropout of 0.1 seems performant across all models.
PoET: A generative model of protein families as sequences-of-sequences
Generative protein language models are a natural way to design new proteins with desired functions. However, current models are either difficult to direct to produce a protein from a specific family of interest, or must be trained on a large multiple sequence alignment (MSA) from the specific family of interest, making them unable to benefit from transfer learning across families. To address this, we propose Protein Evolutionary Transformer (PoET), an autoregressive generative model of whole protein families that learns to generate sets of related proteins as sequences-of-sequences across tens of millions of natural protein sequence clusters. PoET can be used as a retrieval-augmented language model to generate and score arbitrary modifications conditioned on any protein family of interest, and can extrapolate from short context lengths to generalize well even for small families. This is enabled by a unique Transformer layer; we model tokens sequentially within sequences while attending between sequences order invariantly, allowing PoET to scale to context lengths beyond those used during training.
A Fusion-Driven Approach of Attention-Based CNN-BiLSTM for Protein Family Classification -- ProFamNet
Ali, Bahar, Shah, Anwar, Niaz, Malik, Mansoord, Musadaq, Ullah, Sami, Adnan, Muhammad
Advanced automated AI techniques allow us to classify protein sequences and discern their biological families and functions. Conventional approaches for classifying these protein families often focus on extracting N-Gram features from the sequences while overlooking crucial motif information and the interplay between motifs and neighboring amino acids. Recently, convolutional neural networks have been applied to amino acid and motif data, even with a limited dataset of well-characterized proteins, resulting in improved performance. This study presents a model for classifying protein families using the fusion of 1D-CNN, BiLSTM, and an attention mechanism, which combines spatial feature extraction, long-term dependencies, and context-aware representations. The proposed model (ProFamNet) achieved superior model efficiency with 450,953 parameters and a compact size of 1.72 MB, outperforming the state-of-the-art model with 4,578,911 parameters and a size of 17.47 MB. Further, we achieved a higher F1 score (98.30% vs. 97.67%) with more instances (271,160 vs. 55,077) in fewer training epochs (25 vs. 30).
Towards a more inductive world for drug repurposing approaches
de la Fuente, Jesus, Serrano, Guillermo, Veleiro, Uxía, Casals, Mikel, Vera, Laura, Pizurica, Marija, Pineda-Lucena, Antonio, Ochoa, Idoia, Vicent, Silve, Gevaert, Olivier, Hernaez, Mikel
Drug-target interaction (DTI) prediction is a challenging, albeit essential task in drug repurposing. Learning on graph models have drawn special attention as they can significantly reduce drug repurposing costs and time commitment. However, many current approaches require high-demanding additional information besides DTIs that complicates their evaluation process and usability. Additionally, structural differences in the learning architecture of current models hinder their fair benchmarking. In this work, we first perform an in-depth evaluation of current DTI datasets and prediction models through a robust benchmarking process, and show that DTI prediction methods based on transductive models lack generalization and lead to inflated performance when evaluated as previously done in the literature, hence not being suited for drug repurposing approaches. We then propose a novel biologically-driven strategy for negative edge subsampling and show through in vitro validation that newly discovered interactions are indeed true. We envision this work as the underpinning for future fair benchmarking and robust model design. All generated resources and tools are publicly available as a python package.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > Spain > Navarre > Pamplona (0.04)
- Europe > Belgium (0.04)
- Asia > China (0.04)
Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-ligand Structure Prediction Models
Liu, Lihang, He, Donglong, Ye, Xianbin, Zhou, Jingbo, Zhang, Shanzhuo, Zhang, Xiaonan, Li, Jun, Chai, Hua, Wang, Fan, He, Jingzhou, Zheng, Liang, Li, Yonghui, Fang, Xiaomin
Protein-ligand structure prediction is an essential task in drug discovery, predicting the binding interactions between small molecules (ligands) and target proteins (receptors). Although conventional physics-based docking tools are widely utilized, their accuracy is compromised by limited conformational sampling and imprecise scoring functions. Recent advances have incorporated deep learning techniques to improve the accuracy of structure prediction. Nevertheless, the experimental validation of docking conformations remains costly, it raises concerns regarding the generalizability of these deep learning-based methods due to the limited training data. In this work, we show that by pre-training a geometry-aware SE(3)-Equivariant neural network on a large-scale docking conformation generated by traditional physics-based docking tools and then fine-tuning with a limited set of experimentally validated receptor-ligand complexes, we can achieve outstanding performance. This process involved the generation of 100 million docking conformations, consuming roughly 1 million CPU core days. The proposed model, HelixDock, aims to acquire the physical knowledge encapsulated by the physics-based docking tools during the pre-training phase. HelixDock has been benchmarked against both physics-based and deep learning-based baselines, showing that it outperforms its closest competitor by over 40% for RMSD. HelixDock also exhibits enhanced performance on a dataset that poses a greater challenge, thereby highlighting its robustness. Moreover, our investigation reveals the scaling laws governing pre-trained structure prediction models, indicating a consistent enhancement in performance with increases in model parameters and pre-training data. This study illuminates the strategic advantage of leveraging a vast and varied repository of generated data to advance the frontiers of AI-driven drug discovery.
- Europe > Germany > Rheinland-Pfalz > Mainz (0.04)
- Asia > China > Sichuan Province > Chengdu (0.04)
- Africa > Ethiopia > Addis Ababa > Addis Ababa (0.04)